First we will need to load a couple R libraries we will need in order to manipulate our data.
suppressMessages(library(tidyverse))
First things first, lets read in our datasets. If you haven’t downloaded the datasets yet, please do so from my github.
course_meta = read_tsv("./datasets/course-meta.csv")
## Parsed with column specification:
## cols(
## catch = col_datetime(format = ""),
## id = col_character(),
## firstClear = col_character(),
## tag = col_character(),
## stars = col_double(),
## players = col_double(),
## tweets = col_double(),
## clears = col_double(),
## attempts = col_double(),
## clearRate = col_double()
## )
course = read_tsv("./datasets/courses.csv")
## Parsed with column specification:
## cols(
## id = col_character(),
## difficulty = col_character(),
## gameStyle = col_character(),
## maker = col_character(),
## title = col_character(),
## thumbnail = col_character(),
## image = col_character(),
## creation = col_datetime(format = "")
## )
colnames(course_meta)
## [1] "catch" "id" "firstClear" "tag" "stars"
## [6] "players" "tweets" "clears" "attempts" "clearRate"
colnames(course)
## [1] "id" "difficulty" "gameStyle" "maker" "title"
## [6] "thumbnail" "image" "creation"
As we can see the column names of our datasets are pretty well named except for ‘catch’ which is what they use for timestamp, so we will change that.
colnames(course_meta)[colnames(course_meta) == "catch"] = "timestamp"
The data is also already pretty well organized and doesn’t seem to violate any of the rules of tidy data.Each row contains one sample, each column describes one aspect, and each value has its own cell. We can thank our source for that! Now lets move on and begin to visualize our data a bit.
Now lets go ahead and take a look at our tidy datasets.
course
course_meta
Here we can see how many of each course was made categorized by difficulty.
(difficulty_total_plot = ggplot(data=course) + geom_bar(aes(x=course$difficulty), width=.3, fill='blue')
+ xlab("Difficulty") + ylab("Count"))
Next, lets take a look at how many times each course was played based on its difficulty. First lets get the total number of attempts for each course. Since our course_meta table rows contain how many attempts each person has done on a given course, we will want to aggregate the sum of all these attempts, group them by course_id, then append that column to the course table.
course = inner_join(aggregate(course_meta$attempts, by=list(id=course_meta$id), FUN=sum), course, by='id')
colnames(course)[colnames(course) == "x"] = "attempts"
Now lets build our next graph to show attempts based on difficulty.
attempts_vs_diffuculty = aggregate(course$attempts, by=list(course$difficulty), FUN=sum)
colnames(attempts_vs_diffuculty)[colnames(attempts_vs_diffuculty) == 'Group.1'] = "difficulty"
colnames(attempts_vs_diffuculty)[colnames(attempts_vs_diffuculty) == 'x'] = "count"
(attempts_vs_diffuculty_plot = ggplot(data=attempts_vs_diffuculty, aes(difficulty, count) ) + geom_bar(width=.3, fill = 'blue', stat='identity')) + labs(x="Difficulty",y="Attempts")
As we can see, courses with the superExpert tag have about 4x as many attempts that any other course difficulty. When we think about this, it is expected since superExpert courses are much harder thus require many more attempts to beat that level.
Moving forward some of the questions I would like to answer is what makes a course a course popular. I plan to answer this by looking into what courses with high amounts of plays have in common. Is there anything specifically that contributes to the likeness of a course being higher than others? As of now, it’s hard to say what exactly could lead to this, but I’m guessing who made the course and what difficulty it is will contribute a lot to its likeness.